========================================================
## [1] 1599 13
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
The highest quality is 8, the lowest is 3. Qualities of 5 & 6 occur more often than the other, quality of 7 comes after.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The highest alcohol is 14.90, and the lowest one is 8.40, with the peak count of around 9.5, it’s a right skewed distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
The highest density is 1.0037, and the lowest one is 0.9901. The peak count is the one of around 0.9975, it’s roughly a normal distribution.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
The dataset contains 1599 obs and 13 variables.
I want to investigate which chemical properties influence the quality of red wines, therefore, quality is the main feature of interest. Other chemical properties are also very important.
investigation into your feature(s) of interest?
Most of the chemical properties would affect the quality, or at least have some weak correlations with the quality.
Not yet, I will create the conditional means variables for other chemical properties with quality in the next section.
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?
About the alcohol, it’s a right skewed distribution.
##
## Pearson's product-moment correlation
##
## data: wine$fixed.acidity and wine$quality
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
Creat a scatterplot for fixed.acidity and quality, it seems that there is not a stong relationship between them according to the regression line. Also, because the quality is displayed as integer, this graph does not show the continuity of the change. So it’s better to get the conditional means of the quality by the fixed.acidity and plot the geom_line.
Plot the scatterplot of the mean values of quality for every specific value of fixed acidity.
##
## Pearson's product-moment correlation
##
## data: wine$density and wine$quality
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
Create the scatterplot for density and quality, there is a weak negative correlation between them.
Create the scatterplot for the conditional means of quality by density, also, a weak negative correlation between them. It can be seen from the loess that there is a weak neagtive correlation at first, then after the density of 0.9975, there is a slightly positive correlation.
##
## Pearson's product-moment correlation
##
## data: wine$pH and wine$quality
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106451268 -0.008734972
## sample estimates:
## cor
## -0.05773139
Weak correlation between pH and quality
##
## Pearson's product-moment correlation
##
## data: wine$alcohol and wine$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
It seems that there is a medium positive correlation between the two variables regarding its 0.48 efficiency. Check the conditional means of quality by alcohol in the next step.
##
## Pearson's product-moment correlation
##
## data: quality_alcohol$quality_mean and quality_alcohol$alcohol
## t = 5.9481, df = 63, p-value = 1.301e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4167388 0.7359429
## sample estimates:
## cor
## 0.5996846
The graph shows that there is a medium positive correlation between the two variables, however, the loess indicates a converse trend above the alcohol of 14, which is mainly caused by outliners. Remove those outliners to check the loess again in the next step.
After removing those outliners, the loess shows no negative correlation between the two variables. We can say that the alcohol would affect the quality by a medium positive effect. But, if the alcohol concentration is too high, such as above 14, it would affect the quality a little bit.
##
## Pearson's product-moment correlation
##
## data: wine$sulphates and wine$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
A weak positive correlation
##
## Pearson's product-moment correlation
##
## data: wine$volatile.acidity and wine$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
Medium negative correlation between volatile.acidity and quality, check the conditional means of quality by volatile.acidity in the next step
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and quality_mean
## t = -12.49, df = 141, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7943848 -0.6362877
## sample estimates:
## cor
## -0.7247403
Medium negative correlation between the two variables can also be indicated.
##
## Pearson's product-moment correlation
##
## data: wine$citric.acid and wine$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
weak correlation
##
## Pearson's product-moment correlation
##
## data: wine$residual.sugar and wine$quality
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
weak correlation
##
## Pearson's product-moment correlation
##
## data: wine$chlorides and wine$quality
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17681041 -0.08039344
## sample estimates:
## cor
## -0.1289066
weak negative correlation
##
## Pearson's product-moment correlation
##
## data: wine$free.sulfur.dioxide and wine$quality
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.099430290 -0.001638987
## sample estimates:
## cor
## -0.05065606
weak correlation
##
## Pearson's product-moment correlation
##
## data: wine$total.sulfur.dioxide and wine$quality
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
weak correlation
##
## Pearson's product-moment correlation
##
## data: wine$sulphates and wine$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
A weak positive correlation, however, according to the loess, there is lightly negative correlation after 0.75 of sulphates. Perhaps because of insufficient data.
##
## Pearson's product-moment correlation
##
## data: total.sulfur.dioxide and free.sulfur.dioxide
## t = 36.341, df = 1595, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6452693 0.6989950
## sample estimates:
## cor
## 0.673019
There is a medium positive correlation between total sulfur dioxide and free sulfur dioxide, probably because total sulfur dioxide contains free sulfur dioxide, just different proportions in different wines.
##
## Pearson's product-moment correlation
##
## data: wine$pH and wine$density
## t = -14.53, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3842835 -0.2976642
## sample estimates:
## cor
## -0.3416993
There is a meidum negative correlation between pH and density.
## Low Medium High
## 63 1319 217
I divide the grades into three groups, grade 3, 4 as “low”, grade 5, 6 as “medium”, grade 7, 8 as “high”. Since alcohol and volatile acidity are the two features influcing the quality most, I will do distribution analysis on this two.
We can see from that higher the quality, higher the alcohol, the center of the distiribution moves to the right.
We can see from that higher the quality, lower the volatile.acidity, the center of the distiribution moves to the left.
investigation. How did the feature(s) of interest vary with other features in
the dataset?
Positive correlation with quality: 1. fixed.acidity cor 0.1240516 weak 2. sulphates cor 0.2513971 weak 3. citric acidity cor 0.2263725 weak 4. residual sugar cor 0.01373164 weak 5.alcohol cor 0.4761663 medium
Negative correlation with quality: 1. density cor -0.1749192 weak 2. ph cor -0.05773139 weak 3. volatile acidity cor -0.3905578 medium 4. chlorides cor -0.1289066 weak 5. free sulfur dixiode cor -0.05065606 weak 6. total sulfur dixiode cor -0.1851003 weak
(not the main feature(s) of interest)?
free sulphur dixiode and total sulphur dixiode have a medium positive correlation, mainly because total sulphur dixiode contains free sulphur dixiode
A meidum negative correlation between pH and density.
So far, the strongest relationship is between alcohol and quality. Except the observation for the main features, relationship between free sulphur dixiode and total sulphur dixiode is the strongest.
We can add another dimension into the graph using different colors, alcohol and volatile.acidity are the two features affecting the quality most. We can see from that since the quality are discrete numbers, it is a little bit over-plotted.
Using the jitter plot seems to make it better. It can be seen from that higher quality will have higher alcohol and lower volatile.acidity.
##
## Pearson's product-moment correlation
##
## data: wine$volatile.acidity and wine$alcohol
## t = -8.2546, df = 1597, p-value = 3.155e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2488416 -0.1548020
## sample estimates:
## cor
## -0.202288
It can be seen from that higher alcohol has higher quality, and higher volatile.acidity has lower quality. Alcohol and volatile acidity have a relatively weak negative correlatin.
##
## Pearson's product-moment correlation
##
## data: wine$density and wine$alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5322547 -0.4583061
## sample estimates:
## cor
## -0.4961798
Alcohol and density have a medium negative correlation, and a higher quality has a higher negative coorelation between them.
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?
For the top two features affecting the quality, a higher alcohol has a a higher quality and a higher volatile acidity has a lower quality.
A higher alcohol will have a lower density. Higher the quality, higher the negative correlation between them.
This graph shows the distribution of wine grades. Grades 5 & 6 have the highest counts which are all above 600 in this sample. Grade 3 has the lowest count.
Alcohol and wine quality has a medium positive correlation which is the highest among all those features.
This graph shows the relationship between alcohol and volatile acidity, and both of them’s relationship with quality. It can be seen from that there is a negative correlation between them, and also a higher alcohol has a higher quality, a higher volatile.acidity has a lower quality.
This original dataset contains 1599 obsevations and 11 variables of the chemical features. I am interested in which chemical features affecting the wine quality most. Number one is the alcohol, wines with a high concentration of alcohol tend to have a high quailty. Then, it’s the volatile acidity, it affects the quality in an converse way. However, other features don’t have strong correlation with the quality. Surprisingly, Alcohol and density have a medium negative correlation, and a higher quality has a higher negative coorelation between them. I can make future improvement through creating different models to analyse those chemical features, such as linear regression and decision tree.